The UN data set available in R is used for this analysis. It provides information on national health, welfare and population statistics for 213 locations, mainly UN members. Different hypothesis are tested to gain insights into the differences in the groups and regions.
fertility: total fertility rate, number of children per woman
ppgdp: Gross domestic product per capita in US dollars
lifeExpF: female life expectancy in years
pctUrban: urban population in %
infantMortality: Infant mortality at the age of 1 year per 1000 live births
groups: OECD, Africa, Other
region: Africa, Asia, Caribbean, Europe, Latin Amer, North America, NorthAtlantic, Oceania
## region group fertility ppgdp
## Africa :52 oecd : 31 Min. :1.134 Min. : 114.8
## Asia :50 other :110 1st Qu.:1.750 1st Qu.: 1239.8
## Europe :39 africa: 52 Median :2.264 Median : 4495.8
## Latin Amer:20 Mean :2.780 Mean : 12291.1
## Oceania :17 3rd Qu.:3.700 3rd Qu.: 14497.3
## Caribbean :13 Max. :6.925 Max. :105095.4
## (Other) : 2
## lifeExpF pctUrban infantMortality
## Min. :48.11 Min. : 11.0 Min. : 1.916
## 1st Qu.:65.10 1st Qu.: 39.0 1st Qu.: 7.243
## Median :75.57 Median : 59.0 Median : 19.637
## Mean :72.08 Mean : 57.1 Mean : 30.739
## 3rd Qu.:79.07 3rd Qu.: 75.0 3rd Qu.: 45.892
## Max. :87.12 Max. :100.0 Max. :124.535
##
Scatter plots for a basic understanding of the data
pairs(UN_data[, c("fertility", "ppgdp", "lifeExpF", "pctUrban", "infantMortality")],
lower.panel = NULL) H1: There are significant differences between Africa and other country groups in terms of life expectancy and fertility rate.
H2: Life expectancy of women in African countries is significantly lower than in other country groups and in OECD countries.
mean_overview_group <- UN_data %>%
select(group, fertility, ppgdp, lifeExpF, pctUrban, infantMortality) %>%
group_by(group) %>%
mutate(N = n()) %>%
summarise_all(mean)
print(mean_overview_group)## # A tibble: 3 Ă— 7
## group fertility ppgdp lifeExpF pctUrban infantMortality N
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 oecd 1.77 37761. 82.4 75.8 4.89 31
## 2 other 2.36 9819. 75.1 58.8 21.7 110
## 3 africa 4.27 2337. 59.4 42.4 65.3 52
sd_overview_group <- UN_data %>%
select(group, fertility, ppgdp, lifeExpF, pctUrban, infantMortality) %>%
group_by(group) %>%
summarise_all(sd)
print(sd_overview_group)## # A tibble: 3 Ă— 6
## group fertility ppgdp lifeExpF pctUrban infantMortality
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 oecd 0.340 22092. 2.09 11.7 3.50
## 2 other 0.941 12502. 5.66 23.4 17.5
## 3 africa 1.29 3424. 8.39 17.7 27.5
Examinnation of the relationships between the variables using visualizations.
ggplot(UN_data, aes(x = group, y = lifeExpF, color = group)) +
geom_boxplot() +
labs(
x = "Group",
y = "Life Expectancy of women (years)"
)+
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))+
theme_minimal()+
guides(color = FALSE)ggplot(UN_data, aes(x = group, y = fertility, color = group)) +
geom_boxplot() +
labs(
x = "Group",
y = "Fertility rate (number of children per woman)"
) +
scale_color_manual(values = c("#00AFBB", "#E7B800", "#FC4E07"))+
theme_minimal()+
guides(color = FALSE)Checking the prerequisites for an ANOVA
Normal distribution of the residuals
Homogeneity of variance (Homoscedasticity)
Testing the residuals for normal distribution
The QQ plot compares the empirical quantiles of the residuals with the theoretical quantiles of a normal distribution. A good fit is indicated by an approximately straight-line relationship between the empirical and theoretical quantiles.
#linear mode
model_fertility <- lm(fertility ~ group, data = UN_data)
#Q-Q-Plot
ggqqplot(residuals(model_fertility))#model
model_lifeExpF <- lm(lifeExpF ~ group, data = UN_data)
#Q-Q-Plot
ggqqplot(residuals(model_lifeExpF))The residuals of both variables do not seem to be clearly normally distributed. As a parametric method, one-way ANOVA provides the best interpretable results if the dependent variable is approximately normally distributed in each group.
Testing for homoscedasticity
leveneTest(data= UN_data, lifeExpF ~ group)
# p=1.516e-06 < 0.05 there is no homogeneity of variance
leveneTest(data= UN_data, fertility ~ group)
# p= 4.358e-06 <0.05 there is no homogeneity of varianceVariance homogeneity is an important prerequisite for ANOVA, as it ensures that the variance of the dependent variables are approximately the same across the groups. If this assumption is violated, the ANOVA results may be unreliable and lead to erroneous conclusions.
The Residuals vs. Fitted diagram shows the residuals against the predicted values. A random pattern without systematic trends indicates homoscedasticity. The data points in the Residuals vs. Fitted Diagram in both cases almost form a triangle shape, it indicates heteroscedasticity, which means that the variance of the residuals is not constant across the range of fitted values.
Since the requirements for a one-way ANOVA are not sufficiently met, the hypothesis will be tested with the non-parametric Kruskal-Wallis-Test instead.
Kruskal-Wallis-Test
##
## Kruskal-Wallis rank sum test
##
## data: fertility by group
## Kruskal-Wallis chi-squared = 83.377, df = 2, p-value < 2.2e-16
##
## Kruskal-Wallis rank sum test
##
## data: lifeExpF by group
## Kruskal-Wallis chi-squared = 115.11, df = 2, p-value < 2.2e-16
The test result includes the chi-square value, the degree of freedom and the p-value. A significant p-value indicates that at least one group differs significantly from the other groups.
Based on p-value < 2.2e-16 <0.05, it can be concluded that there is a statistically significant difference between the birth rates by country group.
Based on p-value < 2.2e-16 <0.05, it can be concluded that there is a statistically significant difference between life expectancy by country group.
post-hoc pairwise comparisons
## Kruskal-Wallis rank sum test
##
## data: x and group
## Kruskal-Wallis chi-squared = 83.3766, df = 2, p-value = 0
##
##
## Comparison of x by group
## (Bonferroni)
## Col Mean-|
## Row Mean | africa oecd
## ---------+----------------------
## oecd | 8.369592
## | 0.0000*
## |
## other | 7.368734 -3.241147
## | 0.0000* 0.0018*
##
## alpha = 0.05
## Reject Ho if p <= alpha/2
## Kruskal-Wallis rank sum test
##
## data: x and group
## Kruskal-Wallis chi-squared = 115.1114, df = 2, p-value = 0
##
##
## Comparison of x by group
## (Bonferroni)
## Col Mean-|
## Row Mean | africa oecd
## ---------+----------------------
## oecd | -10.44108
## | 0.0000*
## |
## other | -7.390998 5.534292
## | 0.0000* 0.0000*
##
## alpha = 0.05
## Reject Ho if p <= alpha/2
If the p-value for a pairwise comparison is smaller than the significance level divided by the number of comparisons (alpha/2 in this case), then the null hypothesis will be rejected. The null hypothesis states that there is no difference between the groups being compared. The Bonferroni correction is used to reduce the probability of wrong results. All pairwise comparisons indicate significant differences between African countries and OECD countries and others in terms of fertility rate and life expectancy. Furthermore,the negative values in the table indicate that the mean life expectancy of women in African countries is lower than in the OECD and other countries.
H3: There are significant differences between the regions Asia and Europe in terms of the urbanisation rates.
mean_overview_region <- UN_data %>%
select(region, pctUrban) %>%
group_by(region) %>%
mutate(N = n()) %>%
summarise_all(mean)
print(mean_overview_region)## # A tibble: 7 Ă— 3
## region pctUrban N
## <fct> <dbl> <dbl>
## 1 Africa 42.4 52
## 2 Asia 57.4 50
## 3 Caribbean 57.8 13
## 4 Europe 70.5 39
## 5 Latin Amer 70 20
## 6 North America 82 2
## 7 Oceania 52.2 17
sd_overview_region <- UN_data %>%
select(region, pctUrban) %>%
group_by(region) %>%
summarise_all(sd)
print(sd_overview_region)## # A tibble: 7 Ă— 2
## region pctUrban
## <fct> <dbl>
## 1 Africa 17.7
## 2 Asia 25.1
## 3 Caribbean 25.1
## 4 Europe 13.1
## 5 Latin Amer 16.9
## 6 North America 1.41
## 7 Oceania 28.3
Examinnation of the relationships between the variables using a boxplot.
ggplot(UN_data, aes(x = region, y = pctUrban, color = region)) +
geom_boxplot() +
labs(
x = "Region",
y = "Urbanisation rate in %"
)+
scale_color_manual(values = c("#c0c0c0", "#FC4E07", "#c0c0c0", "#E7B800", "#c0c0c0", "#c0c0c0", "#c0c0c0"))+
theme_minimal()+
guides(color = FALSE)Checking the prerequisites for an ANOVA
Normal distribution of the residuals
Homogeneity of variance (Homoscedasticity)
Testing the residuals for normal distribution
#shapiro-Wilk-Test
residuals_model <- shapiro.test(resid(model_urban))
print(residuals_model)
#p-value = 0.2188 >0.05 indicates normal distributionThe residuals of the dependent variable seems to be approximately normally distributed.
Testing for Homoscedasticity
leveneTest(data = UN_data, pctUrban ~ region)
# p=1.378e-05 < 0.05 Homogeneity of variance is not assumedVariance homogeneity can be assumed (even though the levene test had a different result) because the data points are relatively evenly distributed. Since the requirements for a one-way ANOVA, normal distribution of the residuals and variance homogeneity are met, the one-way ANOVA will be used to test the hypothesis.
one-way ANOVA
The p-value 1.222e-08 indicates that there are significant differences between the regions, which means that at least one of the regions differs statistically significantly from the others.
post-hoc pairwise comparisons
UN_data %>%
pairwise_t_test(pctUrban~region,
pool.sd = TRUE,
p.adjust.method = "bonferroni") %>%
as.data.frame()Calculating the effect size How big are the differences between Africa and Asia in terms of the urbanisation rate?
subset_data <- subset(UN_data, region %in% c("Africa", "Asia"))
#cohens d comparison between Africa and Asia
cohen.d(pctUrban ~ region, data = subset_data)##
## Cohen's d
##
## d estimate: -0.6927299 (medium)
## 95 percent confidence interval:
## lower upper
## -1.0973001 -0.2881598
Cohen’s d measures the standard deviation of the differences between the groups in relation to the mean standard deviation. A value of -0.6927299 indicates that there is a medium effect.